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Abstract 

Repeat finding in strings has important applications in subfields such as computational biology. Surprisingly, 
all prior work on repeat finding did not consider the constraint on the locality of repeats. In this paper, we 
propose and study the problem of finding longest repetitive substrings covering particular string positions. 
We propose an 0(n ) time and space algorithm for finding the longest repeat covering every position of a 
string of size n. Our work is optimal since the reading and the storage of an input string of size n takes 0{n ) 
time and space. Because any substring of a repeat is also a repeat, our solution to longest repeat queries 
effectively provides a “stabbing” tool for practitioners for finding most of the repeats that cover particular 
string positions. 
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1. Introduction 

Repetitive structures and regularities finding in genomes and proteins is important as these structures 
play important roles in the biological functions of genomes and proteins [I]. It is well known that overall 
about one-third of the whole human genome consists of repeated subsequences [|j; about 10-25% of all 
known proteins have some form of repetitive structures Q ■ In addition, a number of significant problems in 
molecular sequence analysis can be reduced to repeat finding [ij. Another motivation for finding repeats is to 
compress the DNA sequences, which is known as one of the most challenging tasks in the data compression 
field. DNA sequences consist only of symbols from {ACGT} and therefore can be represented by two bits 
per character. Standard compressors such as gzip and bzip usually use more than two bits per character 
and therefore cannot reach good compression. Many modern genomic sequence data compression techniques 
highly rely on the repeat finding in the sequences [H, H]. 

The notion of maximal repeat and super maximal repeat 0,0,0,11 captures all the repeats of the whole 
string in a space-efficient manner, but it does not track the locality of each repeat and thus can not support 
the finding of repeats that cover a particular string position. In this paper, we propose and study the problem 
of finding longest repetitive substrings covering any particular string positions. Because any substring of 
a repeat is also a repeat, the solution to longest repeat queries effectively provides a “stabbing” tool for 
practitioners for finding most of the repeats that cover particular string positions. 

In this paper, we propose an 0(n) time and space algorithm that can find the leftmost longest repeat of 
every string position. We view our solution to be optimal in both time and space, because one has to spend 
Cl(n) time and space to read and store the input string. 
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2. Preliminary 

We consider a string S[1... n], where each character is drawn from an alphabet £ = {1,2,..., a}. 
A substring S[i ... j] of S represents 5[z]5[i + 1] ... S[j) if 1 < i < j < n, and is an empty string if i > j. 
String S[i '... j'] is a proper substring of another string S[i... j] if i < i' < j' < j and j' — i' < j — i. The 
length of a non-empty substring S[i.. .j], denoted as |5[i ... j]\, is j — i + 1. We define the length of an 
empty string as zero. A prefix of S' is a substring S[1... i] for some i, 1 < i < n. A proper prefix S[1... i] 
is a prefix of S where i < n. A suffix of S is a substring S[i... n] for some i, 1 < i < n. A proper suffix 
S[i.. .n] is a suffix of S where i > 1. We say the character S[i] occupies the string position i. We say the 
substring S[i.. .j] covers the fcth position of S, if i < k < j. For two strings A and B , we write A = B 
(and say A is equal to B ), if |A| = \B\ and A[i\ = B[i\ for i = 1,2,..., |A|. We say A is lexicographically 
smaller than B , denoted as A < B, if (1) A is a proper prefix of B, or (2) A[l] < B[ 1], or (3) there exists 
an integer k > 1 such that A[i\ = B[i\ for all 1 < i < k — 1 but A[k] < B[k\. A substring S{«... j] of S is 
unique, if there does not exist another substring S’fz'.. .j'] of S , such that S[i... j] = S[i'...j '] but i ^ i 1 . 
A substring is a repeat if it is not unique. A character S'fi] is a singleton, if it appears only once in S. 

Definition 2.1. For a particular string position k G {1,2, the longest repeat (LR) covering 

position k. denoted as LR k, is a repeat substring S'fz... j], such that: (1) i < k < j, and (2) there does not 
exist another repeat substring S[i'... j'], such that i' < k < j' and j 1 — %’> j — i. 

Definition 2.2. For a paHicular string position k G {1,2, the left-bounded longest repeat 

(LLR) starting at position k, denoted as LLRk, is a repeat S{/c... j], such that either j = n or S[k ... j+ 1] 
is unique. 

Obviously, for any string position fc, if S[k] is not a singleton, both LRk and LLRk must exist, because 
at least S[k] itself is a repeat. Further, there might be multiple choices for LRk■ For example, if S = 
abcabcddbca, then LR 2 can be either Sfl... 3] = abc or 5'[2... 4] = bca. However, if LLRk does exist, it 
must have only one choice, because k is a fixed string position and the length of LLRk must be as long as 
possible. 

The suffix array 5A[1... n] of the string S is a permutation of {1,2,..., n}, such that for any i and 
j, 1 < i < j < n, we have ... n] < S{<S'A[j]... n]. That is, 5A[z] is the starting position of the zth 

suffix in the sorted order of all the suffixes of S. The rank array Rank[ 1... n] is the inverse of the suffix 
array. That is, Rank[i\ = j iff 5A[j] = i. The longest common prefix (lcp) array LCP[l.. .n + 1] is 
an array of n + 1 integers, such that for i = 2, 3,..., n, LCP[i\ is the length of the lcp of the two suffixes 
S'[S'A[* — 1] .. .n] and S'fS'Afz].. ,n\. We set LCP{ 1] = LCP[n + 1] = 0. In the literature, the lcp array is 
often defined as an array of n integers. We include an extra zero at LCP[n + 1] is only to simplify the 
description of our upcoming algorithms. Table Q] in the appendix shows the suffix array and the lcp array 
of the example string mississippi. 

The next Lemma 12.II shows that, by using the rank array and the lcp array of the string S, it is easy to 
calculate any LLR{ if it exists or to detect the fact that it does not exist. 

Lemma 2.1. For i = 1,2,... ,n: 

LLR = { + > */ Li > 0 

* { does not exist , if Li = 0 

where Li = ma,x{LCP[Rank[i ]], LCP[Rank[i] + 1]}. 

Proof. Note that Li is the length of the lcp between the suffix S{i... n] and any other suffix of S. If Li > 0, 
it means substring S{i... Li — 1] is the lcp among i'fz... n] and any other suffix of S. So S[i ... Li — 1] is 
LLRi. Otherwise (Li = 0), the letter is a singleton, so LLRi does not exist. □ 
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Algorithm 1: Find LRf.. Return the leftmost one if k has multiple LRs. 

Input: The position index k, and the rank array and the lcp array of the string S 
Output: LRk or find no such LR. The leftmost one will be returned if k has multiple LRs. 


1 start < -1; length <— 0 ; 

2 for i — k down to 1 do 

3 L «— max{ LCP[Rank[i\], LCP[Rank[i] + 1]}; 

4 if L = 0 or i + L — l<k then 

5 break; 


6 else if L > length then 

7 start t— i; length •<— L; 


8 Print LRk -f- (start,length}-, 


// start position and length of LRk 

// Length of LLRi 
// LLRi does not exist or does not cover k. 

// Early stop 

// Tie is resolved by picking the leftmost one. 


3. Longest repeat finding for one position 

In this section, we want to find LRk for a given string position k , using 0(n ) time and space. We present 
the solution to this setting here in case the practitioners have only a smaller number of string positions, 
for which they want to find the longest repeats, and thus this light-weighted solution will suffice. We will 
start with finding the leftmost LRk if the string position k is covered by multiple LRs. In the end of the 
section, we will show a trivial extension to find all LRs covering position k with the same time and space 
complexities, if k has multiple LRs. 

Lemma 3.1. Every LR is an LLR. 

Proof. Assume that LRk = £[*.. .j] is not an LLR. Note that S[z.. .j] is a repeat starting from position i. 
If S[i... j] is not an LLR, it means S[i ... j] can be extend to some position f > j, so that S[i.. .j'} is still 
a repeat and also covers position k. That says, S^z ■ ■ ■ j'}\ > |5[z... j]|. However, the contradiction is that 
S[i ... j] is already the longest repeat covering position k. □ 

Lemma 3.2. For any three string positions i, j, and k, 1 < i < j < k < n: if LLRj does not exist or does 
not cover position k, LLRi does not exist or does not cover position k either. 

Proof. (1) If LLRj does not exist, then S[j] is a singleton. If LLRi does exist and covers position k, then 
LLRi also covers position j, which yields a contradiction that the substring LLRi includes the singleton S[j] 
but is a repeat. (2) If LLRj = S[j ... t] does exist but does not cover position k, then S'fj ... t + 1] is unique 
and t + 1 < k. If LLRi exists and covers position k, say LLRi = -S[z... r], r > k, it means S[j ... t + 1] is 
a substring of a repeat LLRi = S[i... r], because i < j < t + 1 < r, so S’fj ... t + 1] is also a repeat. This 
contradicts to the fact that S'fj ... t + 1] is unique. So LLRi does not exist or does not cover position k. □ 

The idea behind the algorithm for finding the LR covering a given position is straightforward. Algorithm!!] 
shows the pseudocode, where the found LR is returned as a tuple (start, length), representing the starting 
position and the length of the LR, respectively. If the LR that is being searched for does not exist, (—1,0) is 
returned by Algorithm[TJ We know that any longest repeat covering position k must be an LLR (Xemma |3.1ll . 
starting between indexes 1 to k inclusive. What we need to do is to simply compute every individual of 
LLRi ■ ■ ■ LLRk using Lemma 12.11 and check whether it covers position k or not. We will just choose the 
longest LLR that covers position k and resolve the tie by picking the leftmost one if k is covered by multiple 
LRs (Line [6]). Due to Lemma 13.21 a practical speedup is possible via an early stop (Line 0 by computing 
and checking from LLRk down to LLRi (Line 0 • 

Lemma 3.3. Given the rank array and the lcp array of the string S, for any position k in the string S, 
Algorithm Q] can find LRk or the fact that it does not exist, using 0(k) time and 0(n) space. If there are 
multiple candidates for LRk, the leftmost one is returned. 
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Proof. The algorithm clearly has no more than k steps and each step takes 0(1) time, so it costs a total 
of 0(k) time. The space cost is primarily from the rank array and the lcp array, which altogether is 0(n ), 
assuming each integer in these arrays costs a constant number of bytes. 

□ 

Theorem 3.1. For any position k in the string S, we can find LRk or the fact that it does not exist, using 
0(n ) time and space. If there are multiple candidates for LRk, the leftmost one is returned. 

Proof. The suffix array of S can be constructed by existing algorithms using 0(n) time and space (For ex., 
0)- After the suffix array is constructed, the rank array can be trivially created using 0(n) time and space. 
We can then use the suffix array and the rank array to construct the lcp array using another 0(n) time and 
space EH- Given the rank array and the lcp array, the time cost of Algorithm [T] is 0(k) ('Lemma 13.3D . So 
altogether, we can find LRk or the fact that it does not exists using 0(n) time and space. If multiple LRs 
cover position k , the leftmost LR will be returned as is guaranteed by Line [G] of Algorithm [TJ □ 

Extension: Find all LRs covering a given position. It is trivial to extend Algorithm |T| to find all the 
LRs covering any given position k as follows. We can first use Algorithm Q] to find the leftmost LRk- If LRk 
does exist, then we will start over again to recheck LLRk down to LLRi and return those whose length is 
equal to the length of LRk- Due to Lemma lT2l the same early stop as we have in Algorithm |T| can be used 
for a practical speedup. The pseudocode of this procedure is provided in Algorithm Q] in the appendix, which 
clearly costs an extra 0(k) time. Combining Theorem 13.11 we have: 

Theorem 3.2. We can find all the LRs covering any given position k using 0(n ) time and space. 


4. Longest repeat finding for every position 

In this section, we want to find LRk of every position k = 1,2,..., n. If any position k is covered by 
multiple LRs, the leftmost one will be returned. A natural solution is to iteratively use Algorithm Q] as 
a subroutine to find every LRk, for k = 1,2 ,...,n. However, the total time cost of this solution will be 
0(n) + 0{k) = 0(n 2 ), where 0(n ) captures the time cost for the construction of the rank array and 

the lcp array and O(fc) is the total time cost for the n instances of Algorithm [l] We want to have a 

solution that costs a total of 0(n) time and space, which follows that the amortized cost for finding each 
LR is 0(1). 

4-1. A conceptual algorithm 

We will first calculate LLR\, LLR 2 ,..., LLR n using Lemma [2~T1 and save the results in an array LLRS[ 1... n\. 
Each LLR is represented by a tuple (start, length), the starting position and the length of the LLR. We 
assign zero as the length of any non-existing LLR, which does not cover any string position. We then sort 
the LLRS array in the descending order of the lengths of the LLRs, using a stable and linear-time sorting 
procedure such as the counting sort. 

Definition 4.1. After the LLRS array is stably sorted, let P\ denote the string positions that are covered 
by LLRS[ 1], and Pi, 2 < i < n, denote the string positions that are covered by LLRS[i] but are not covered 
by any of LLRS[1.. .i — 1]. Let |Pi[ denote the number of string positions belonging to Pi. 

Note that any P, , i > 1, can possibly be empty. Our conceptual algorithm will then assign LLRS[i] as 
the LR of those string positions belonging to Pi, if Pi is not empty, for i = 1,2 ,... ,n. We store the LRs that 
we have calculated in an array LRS[ 1 .. .n] of (start, length) tuples, where LRS[i\ = LRi and LRS [i]. start 
and LRS[i].length represent the starting position and length of LRi. If LRi does not exist, the tuple (—1,0) 
will be assigned to LRS[i], which can be done during the initialization of the LRS array. Early stop can be 
made when (1) we meet an LLRS array element whose length is zero, which indicates that all the remaining 
LLRS array elements also have lengths of zero; or (2) every string position has had their LR calculated. 
Algorithm [2] shows the pseudocode of this conceptual algorithm. 
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Algorithm 2: The conceptual algorithm for finding the leftmost LR for every non-singleton string 
position of S. 


Input: The rank array and the lcp array of the string S 

Output: The leftmost LR covering every non-singleton string position of S. 


1 

2 
3 


/* Calculate the LLRS array using Lemma [2.11 . Initialize the LRS array.. */ 

for i = 1, 2,..., n do 

LLRS[i\ <— (i, ma,x{LCP[Rank[i]], LCP[Rank[i\ + 1]}) ; // LLRi, in the format of (start, length) 

LRS[i] <— (— 1,0) ; // LRi, in the format of (start, length) 


4 Stably sort LLRS[ 1. .. n] in the descending order of its second dimension ; 


// e.g.: counting sort. 


5 

6 

7 

8 
9 

10 


/* Find the leftmost LR for every position 

count •<— 0 ; // The number of non-singleton string positions that 

for i = 1, 2,..., n do 

if count = n or LLRS[i\.length = 0 then break ; 
if |Pj| = 0 then continue; 

foreach k € Pi do LRS[k] <— LLRS[i\ ; // Calculate the LRs of the 

count <r- count + \Pi\-, 


*/ 

have their LRs calculated. 

// Early stop 
positions belonging to Pi. 


ll return LRS[ 1 . .. n] 


Lemma 4.1. Algorithm^ finds the LR for every position that does not contain a singleton. It finds the 
leftmost LR if any position is covered by multiple LRs. 

Proof. The proof of the lemma is obvious. Recall that every LR must be an LLR (Lemma 13.111 and we 
process all LLRs in descending order of their lengths. For i = 1,2,... ,n, if Pi is not empty, then for each 
position in Pi, the substring LLRS[i] is the longest LLR that covers that position, i.e., LLRS[i] is the LR 
of that position. In the case where any position in Pi has multiple LRs, LLRS[i\ must be the leftmost LR 
because of the stable sorting of the LLRS array. □ 

4-2. High-level strategy for a fast implementation 

The challenge is to implement the conceptual algorithm (Algorithm [2]) efficiently. Our goal is to use 
0(n) time and space only, which is optimal since we have to spend 0(n) time and space to report all the 
LRs of all the n distinct string positions. We start with some property of each Pi (Definition [XT]). Recall 
that, in Algorithm [21 we process all the LLRs in the descending order of their lengths, and also all LLRs 
start from distinct string positions. Therefore, after the LLRS array is sorted (Line IH Algorithm [2J , none 
of LLRS[ 1... i — 1] can be a substring of LLRS[i], for any i > 2. This yields the following fact. 

Fact 4.1. Every non-empty Pi, i> 1, is a continuous chunk of string positions, i.e., every non-empty Pi is 
an integer range [sj,ej], where Si and e* are the starting and ending string positions of Pi. 

In the case where Pi is empty, we set s* = e* = — 1. In order to achieve an overall 0(n)-time imple¬ 
mentation of Algorithm SI we need a mechanism that can quickly find S; using 0(1) time when processing 
each LLRS[i\. Then, if Sj ^ —1, due to Fact 14.11 we can just linearly walk from string position s* through 
the position ej, which is either the right boundary of LLRS[i\ or a string position whose next neighboring 
position has had its LR calculated, whichever one is reached first. We will then set the LR of each visited 
position during the walk to be LLRS[i\, achieving an overall 0(n) time implementation of Algorithm [2] 

When we process a non-empty LLRS[i\ and calculate its Si, there are two cases. Case 1: The string 
position LLRS[i\.start has not had its LR calculated, then obviously Si = LLRS[i\.start. Case 2: The string 
position LLRS[i\.start has already had its LR calculated, then it is either Sj > LLRS[i\.start (if Pi is not 
empty) or Si = — 1 (if Pi is empty). In this case, it will not be efficient to find S; by simply walking from 
LLR.S[i\.start toward until we reach or a string position whose LR has not been calculated. It is not 
immediately clear how to calculate Si using 0(1) time. This leads to the design of our following mechanism 
that enables us to calculate every Sj in Case 2 using 0(1) time. 
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4-3. The two-table system: the ptr and next arrays 

Our mechanism is built upon two integer arrays, ptr[ 1... n] and next[ 1... n]. We update the two arrays 
online when we process the sorted LLRS array elements in the calculation of the LR of every string position. 
Ideally , we want to maintain these two arrays, such that for any string position k that has had its LR 
calculated, next [ptr [fc]] is either the next after-A; string position whose LR is not calculated yet or n + 1 
if no such after-fc string position exists. Then, when we process a particular non-empty LLRS[i], if the 
string position LLRS[i\. start has had its LR calculated, we can either directly get Si or find the fact that all 
string positions covered by LLRS[i\ have had their LRs calculated, by comparing next ptr [LLRS [*]. start] 

and LLRS[i\.start + LLRS[i\.length — 1 (the right boundary of LLRS[i]). However, it is not clear how to 
achieve such an ideal maintenance of the ptr and next arrays in a time-efficient manner. This motivates us 
to maintain these two arrays approximately , which is to maintain the following invariance. We will show 
later that such approximate maintenance of the ptr and next arrays can still help calculate every s, using 
0(1) time. 


4-3.1. Invariance. 

We initialize every element of both ptr and next arrays to be —1. Recall that after the LLRS array is 
sorted in descending order of the lengths of the LLRs, we process every LLRS[i\, for i = 1,2,..., n. After we 
have finished the processing of LLR.S[ 1... i — 1], for any i > 2, we want to maintain the following invariance 
for the ptr and next arrays when processing LLRS[i\. 

1. If LLRS[i\.start has already had its LR calculated but \Pi\ > 0, then: 


next 


ptr [LLRS [i] .start] 


= Si 


2. If \Pi\ = 0 but LLRS[i\.length > 0 (i.e.: LLRS[i\ is not empty), then next ptr[LLRS[i\.start] 
larger than the index of the right boundary of LLRS[i\. That is, 


is 


next 


ptr[LLRS[i].start] > LLRS[i].start + LLRS[j].length — 1 


4-3.2. Using the invariance. 

Recall that when we process a particular non-empty LLR.S[i], we want to calculate Si quickly. The hard 
case is when the string position LLRS[i\.start has already had its LR calculated. Provided with the above 
invariance of the ptr and next arrays, when we process a non-empty LLRS[i], we will first check the value 


of ptr [LLRS[i].start]. If it is not equal to —1, the hard case occurs. Then, if next ptr [LLRS[i].start] 


< 


LLRS[i].start+LLRS[i].length —1 (the right boundary of LLRS[i]), we can assert Sj 
otherwise, we can assert Pi is empty and thus will simply skip LLRS[i]. 


next 


ptr [LLRS [*] .start] 


4-4- Maintaining the two-table system. 

In the following, we will first describe how we update the ptr and next arrays when processing every 
LLRS[i]. In the end, we will explain why the invariance is maintained using an overall 0(n) time. Remind 
that the whole algorithm will early stop if LLRS[i] is empty, so we will only need to describe the algorithmic 
for processing a non-empty LLRS[i]. We first initialize every element in both ptr and next arrays to be — 1. 

We will use the word bucket to denote a maximal and continuous area in the ptr array where all entries share 

the same positive value. So initially, there is no bucket presented in the ptr array. Because all the LLRS 
array elements have been sorted in the descending order of their lengths, the maintenance of the two-table 
system will only have the following five cases to consider (Figure [T|. We use left and right to denote the 
indexes of the left and right boundary of the LLRS[i]. That is, 

left <— LLRS [i] . start ; // the left boundary of LLRS [i] 

right <— LLRS [i] .start + LLRS [i] .length - 1; // the right boundary of LLRS [i] 
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Case 1: 


Case 4: 


C 


C 


1 


Case 2.1: 
Case 3.1: 

Case 5.1: 

Case 5.3: 



Case 2.2: 

Case 3.2: 

Case 5.2: 
Case 5.4: 



positions whose LRs have been calculated 

I I 

string positions that are covered by LLRS[i\ being processed 


Figure 1: Possible cases regarding the re¬ 
lationship between the coverage of LLRS[i\ 
and other string positions whose LRs have 
been calculated 


Case 1: The coverage of LLRS[i] does not connect to or overlap with any string positions whose LRs have 
been calculated. We will create a bucket in the ptr array covering the string positions that are covered 
by LLRS[i\ and set up the corresponding next array entry to be the string position that is right after the 
coverage of LLRS[i]. The following code shows the case condition and the update made to the ptr and next 
arrays. 

if ptr [left] = -1 and ptr [right] = -1 and (left = 1 or ptr [left-1] = -1) 
and (right = n or ptr[right+l] = -1) //case 1 
for j = lef t ... right : ptr[j] «— i; 
next [i] «— right + 1; 

Case 2: The coverage of LLRS[i\ connects to or overlaps with the right side of a string position area whose 
LRs have been calculated. We will extend that area’s corresponding ptr array bucket to the coverage of 
LLRS[i] and update the corresponding next array entry to be the string position that is right after the new 
bucket. 

else if (ptr [left] = -1 and left 7^ 1 and ptr [left-1] 7^ -1) and ptr [right] = -1 

and (right = n or ptr[right+l] = -1) //case 2.1 
for j = lef t ... right : ptr[j] <— ptr[left-l]; 
next[ptr[left-1]] <— right + 1; 

else if (ptr [left] 7^ -1) and ptr [right] = -1 

and (right = n or ptr[right+l] = -1) //case 2.2 
for j = next[ptr[left]]...right: ptr[j] «— ptr[left]; 
next[ptr [left]] <— right + 1; 

Case 3: The coverage of LLRS[i\ connects to or overlaps with the left side of an existing string position 
area whose LRs have been calculated. We will left-extend that area’s corresponding ptr array bucket to the 
coverage of LLRS[i]. We need not to update the corresponding next array entry, since the string position 
that is right after the new ptr bucket does not change. 

else if (ptr [right] = -1 and right 7^ n and ptr[right+l] 7^ -1) and ptr [left] = -1 

and (left = 1 or ptr[left-l] = - 1 ) //case 3.1 
for j = left ... right : ptr[j] 4 — ptr [right+1] ; 

else if (ptr [right] 7^ -1) and ptr [left] = -1 

and (left = 1 or ptr[left-l] = - 1 ) //case 3.2 

j <— left; 

while ptr[j] = -1: ptr[j] «— ptr [right]; j++; 

Case 4: Every string position covered by LLRS[i] has its LR calculated already. In this case, we simply 
do nothing. 

else if ptr [left] 7^ -1 and next[ptr [left]] > right: do nothing; //case 4 

Case 5: The coverage of LLRS[i] bridges two string position areas whose LRs have been calculated. We will 
extend the left area’s corresponding ptr array bucket up to the left boundary of the right area and update 
the next array entry of the left area to be the one of the right area. 

else 

if ptr [left] = - 1 : j <— left; ptr_entry <— ptr[left- 1 ]; //case 5 . 1 , 5.2 
else: j «— next[ptr [left]] ; ptr_entry «— ptr [left]; //case 5 . 3 , 5.4 

while ptr[j] = -1: ptr[j] «— ptr_entry ; j ++; 
next[ptr[ptr_entry]] «— next[ptr [j]] ; 
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Lemma 4.2 (Correctness). The two-table system’s invariance is maintained. 

Proof Sketch: Let us call a ptr array bucket ptr[i.. .j] as a tail bucket if j = n. or ptr[j + 1] = —1. (1) We 
first prove that the invariance is maintained on tail buckets. Observe that any tail ptr bucket is created in 
Case 1 (Figure [l]) and can be extended in Case 2 as well as in Case 3 if the black bucket in Case 3 was also 
a tail bucket. The update to the tail bucket as well as the corresponding next array entry guarantees that, 
for any i belonging to the coverage of a tail bucket, next [ptr [z]] is ideally equal to the index of the string 
position that is right after the bucket (or n + 1 if no such string position exists). So obviously the invariance 
is maintained. (2) We now prove the invariance is also maintained on non-tail buckets. Observe that any 
non-tail bucket is created in Case 5 from the merge of the left black bucket and the new LLRS [z]’s coverage. 
After such non-tail bucket is created, for any position i belonging to a non-tail bucket, next [ptr [z]l is at least 
as large as the index of the string position that is following the right black bucket in Case 5. That means 
next [ptr [z]l — * is larger than the size of any unprocessed llrs array element. This guarantee is maintained, 
because every ne:rt[ptr[z]] only monotonically increases. So, the invariance is also maintained for non-tail 
buckets. (3) Because the invariance is well maintained for all ptr buckets, it is safe to have the condition 
checking as we have written for Case 4. ■ 

Lemma 4.3 (Time complexity). The two-table system is maintained using a total of 0(n) time over the 
course of the processing of the LLRS array elements. 

Proof Sketch: Observe that the updates to the ptr array are made only to those entries whose values were 
— 1 and the new values from the updates are all positive. So there are no more than n updates to the ptr 
array. It is obvious that the number of updates made to the next array is no more than the number of 
updates made to the ptr array. Other than ptr and next array updates, the rest of the maintenance work 
for the two-table system when processing each LLRS array element takes 0(1) time. So the total time cost 
in maintaining the two-table system over the course of the processing of the whole LLRS array is 0(n). ■ 

4.5. The final 0(n) time and space algorithm. 

By combining the conceptual Algorithm [2j the high-level strategy for the fast implementation, and the 
two-table system’s maintenance mechanism, we are ready to produce the final 0[n ) time and space algorithm 
that can find the leftmost LR of every string position. Algorithm [3] shows the pseudocode. It starts with 
the calculation of the LLRS array and the initialization of the LRS, ptr, and next arrays (Line [T] [4]). It 
then sorts the LLRS array in the descending order of the lengths of the array elements using a linear and 
stable sorting procedure (Line [5]) • It then uses the for loop (Line [7]) to process every LLRS array element 
with possible early stop (Line [8]). Using the two-table system, the value of s,; is calculated by Line [TO] if Pi 
is not empty; otherwise, the fact that Pi is empty will also be detected by Line [11] After Sj is calculated, 
finding the LR of each position in Pi becomes obvious (Line [12] [14]). After the LR finding work is done, we 
will update the two-table system (Line [15]) using the code presented in Section 14.41 

Lemma 4.4. Given the lep array and the rank array, Algorithm [3| calculates the leftmost LR of every 
non-singleton position of a string S of size n using a total 0{n) time and space. 

Proof Sketch: (1) Correctness. The correctness of Algorithm [3] immediately follows from of Lemma Phil and 
Lemma l4~2l (2) All data structures that are being involved are the LCP, Rank, LLRS, LRS, ptr, and next 
arrays. Altogether they use 0(n) space. (3) The time cost for the initialization (Line [l] [4]) takes Oin) time, 
the stable sorting (Line [5]) uses 0(n) time. The rest of the work fLine 171 ITbT) also takes 0(n) time, because 
we update every LRS array element no more than once and the two-table system maintenance also takes 
0(n) time [Lemma 14.31) . So the total time cost is 0(n). ■ 

Theorem 4.1. Given a string S of size n, we can calculate the leftmost LR of every of string position using 
0{n) time and space. 


Algorithm 3: The 0{n) time and space algorithm for finding the leftmost LR for every non-singleton 
string position of S. 

Input: The rank array and the lcp array of the string S 

Output: The leftmost LR covering every non-singleton string position of S. 

/* Calculate the LLRS array using Lemma 12 . II . 

Initialize the LRS array and the auxiliary ptr and next arrays. */ 

1 for i — 1, 2,..., n do 

2 LLRS[i\ «— (i, m&x{LCP[Rank[i]], LCP[Rank[i\ + 1 ]}} ; // LLRi, in the format of {start, length) 

3 LRS[i] <—{— 1 , 0 ) ; // LRi, in the format of {start, length) 

4 ptr[i] i -1; next[i\ < -1 ; 

5 Stably sort LLRS[ 1. .. n] in the descending order of its second dimension ; // e .g. : counting sort. 


6 

7 

8 

9 

10 

11 


/* Find the leftmost LR for every position */ 

count •<— 0 ; // The number of non-singleton string positions that have their LRs calculated, 

for i = 1,2,... ,n do 

if count = n or LLRS[i\.length = 0 then break ; // Early stop 

left LLRS[i\.start', right «— LLRS[i\.start + LLRS[i].length — 1: // The boundaries of LLRS[i ]. 

/* first = Si of Pi = [si,ei] if Pi is not empty. */ 

if ptr[left\ = —1 then first <— left; else first <— next [ptr [ie/f]]; 

if first > right then continue ; // Detect the fact that Pi is empty. 


12 

13 

14 


/* Calculate the the leftmost LR of every position in Pi = [si,ei\. 
j ■(- first; 

while j < right and ptr[j] = —1 do 

LRS[j] <— {LLRS[i\.start, LLRS[i\.length); count <— count + 1; j <— j + 1; 


*/ 


15 


Update the two-table system here using the code presented in Section T4.41 


16 return LRS{ 1 . .. n] 


Proof Sketch: We can construct the suffix array of the string S in a total of 0{n) time and space using 
existing algorithms (For ex., [lo|l. The rank array is just the inverse suffix array and can be directly obtained 
from SA using 0{n) time and space. Then we can obtain the lcp array from the suffix array and rank array 
using another 0{n) time and space If]. So the total time and space costs for preparing the rank and lcp 
arrays are 0{n ). The proof of the theorem can then immediately follow from Lemma 14.41 ■ 


5. Conclusion 

In this paper, we proposed the problem of finding longest repeats covering particular string positions, 
motivated by its applications in subfields such as computational biology. We proposed optimal algorithms 
for finding the (leftmost) longest repeat of every string position using a total of 0{n) time and space based 
on a novel two-table system that we designed. We have implemented our algorithms. Future work can be 
an experimental study of the implementation. 
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Appendix 


i 

LCP[i} 

SA[i] 

suffixes 

i 

0 

11 

i 

2 

1 

8 

ippi 

3 

1 

5 

issippi 

4 

4 

2 

ississippi 

5 

0 

1 

mississippi 

6 

0 

10 

Pi 

7 

1 

9 

ppi 

8 

0 

7 

sippi 

9 

2 

4 

sissippi 

10 

1 

6 

ssippi 

11 

3 

3 

ssissippi 

12 

0 

- 

- 


Table .1: The suffix array and the lcp array of an example string S = mississippi. 


Algorithm 4: Find all LRs that cover a given position k 
Input: The position index k, and the rank array and the lcp array of the string S 
Output: All LRs that cover position k or find no such LR. 


/* Find the length of LRk. 

1 length t— 0; 

2 for i = k down to 1 do 

3 L «— ma x{LCP[Rank[i\], LCP[Rank[i] + 1]}; 

4 if L = 0 or * + L — 1 < fc then 

5 break; 


6 else if L > length then 

7 length <— L; 


*/ 


// Length of LLRi 
// LLRi does not exist or does not cover k. 

// Early stop 


/* Print all LRs that cover position k. 
8 if length > 0 then 


9 

10 

11 

12 


for i = k down to 1 do 

L <— max{ LCP[Rank[i]\, LCP[Rank[i] + 1]}; 
if L = 0 or i + L — 1 < k then 
break; 


13 

14 


else if L = length then 
| Print LRk (i, length)', 


15 else Print LRk t— (—1,0); 


*/ 

// LRk does exist. 

// Length of LLRi 
// LLRi does not exist or does not cover k. 

II Early stop 


// LRk does not exist. 
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